Enable Ascend NPU Backend with Custom Ops Integration for NF4 Support by SlightwindSec · Pull Request #1695 · bitsandbytes-foundation/bitsandbytes

SlightwindSec · 2025-07-02T02:33:02Z

What does this PR do?

This PR ports Ascend NPU backend changes from the multi-backend-refactor branch and integrates with custom ops. It includes changes to enable Ascend build and translation of kernels and ops to Ascend-compatible operators. As the AscendC-based high-performance NF4 implementation is still in progress, a temporary PyTorch version is used for now. The build steps remain the same as before from the user's standpoint.

Collaborators

@ji-huazhong @Ginray @Runningwater23 @jiaqiw09
cc @Titus-von-Koeller @matthewdouglas @amathews-amd @sunway513

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>

…ported notice

unlizi · 2025-08-07T11:00:37Z

Error Summary:

Encountered a vector core execution failure on Ascend 910B3 NPU while running Qwen-image NF4 quantized model inference. The NPU reported multiple DDR memory access violations (error code 0x800000) across 12 compute cores, specifically during dequantize_blockwise_fp32_nf4_1kernel execution. The system threw ACL synchronization error (code 507035) when attempting tensor device transfer (pos_freqs.to(device)).

Technical Breakdown:

Hardware-Level: Multiple cores (5-15,20-22) triggered MTE (Memory Tagging Extension) faults indicating invalid DDR address access
Software Stack:
Framework: PyTorch 2.6.1 + torch-npu 2.6.1
CANN: 8.1.RC1
Failing Operation: NF4 dequantization kernel
Error Progression:
Initial memory range violation → Vector core exceptions → Stream synchronization failure
Error suggests possible memory alignment issue or NPU microarchitecture incompatibility with the quantization pattern

(MindSpore) [ma-user qwen-image]$python nf4_inference.py
/home/ma-user/anaconda3/envs/MindSpore/lib/python3.10/site-packages/torch_npu/utils/collect_env.py:58: UserWarning: Warning: The /usr/local/Ascend/ascend-toolkit/latest owner does not match the current owner.
warnings.warn(f"Warning: The {path} owner does not match the current owner.")
/home/ma-user/anaconda3/envs/MindSpore/lib/python3.10/site-packages/torch_npu/utils/collect_env.py:58: UserWarning: Warning: The /usr/local/Ascend/ascend-toolkit/8.1.RC1/aarch64-linux/ascend_toolkit_install.info owner does not match the current owner.
warnings.warn(f"Warning: The {path} owner does not match the current owner.")
The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization are unavailable.
Loading pipeline components...: 0%| | 0/5 [00:00<?, ?it/s]/home/ma-user/anaconda3/envs/MindSpore/lib/python3.10/site-packages/torch_npu/utils/storage.py:38: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
if self.device.type != 'cpu':
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:04<00:00, 2.30s/it]
Loading pipeline components...: 40%|████████████████████████████████████████▍ | 2/5 [00:20<00:25, 8.42s/it]The config attributes {'pooled_projection_dim': 768} were passed to QwenImageTransformer2DModel, but are not expected and will be ignored. Please verify your config.json configuration file.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:40<00:00, 13.49s/it]
Loading pipeline components...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [01:03<00:00, 12.79s/it]
0%| | 0/50 [00:00<?, ?it/s][W807 18:53:28.145560476 compiler_depend.ts:57] Warning: EZ9999: Inner Error!
EZ9999: [PID: 624026] 2025-08-07-18:53:28.091.327 The error from device(chipId:6, dieId:0), serial number is 232, there is an aivec error exception, core id is 5, error code = 0x800000, dump info: pc start: 0x124837587968, current: 0x124837588284, vec error info: 0x5c00003ff6, mte error info: 0x200600006b, ifu error info: 0x2000111ee1ac0, ccu error info: 0xe150ca367a800075, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x124100433000.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1434]
TraceBack (most recent call last):
The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x600006b, fixp_error1 info: 0x20, fsmId:0, tslot:4, thread:0, ctxid:0, blk:26, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1446]
The error from device(chipId:6, dieId:0), serial number is 232, there is an aivec error exception, core id is 10, error code = 0x800000, dump info: pc start: 0x124837587968, current: 0x124837588284, vec error info: 0x6a00000075, mte error info: 0x200600006b, ifu error info: 0x2000111e36a00, ccu error info: 0x7836503b7a800075, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x124100433000.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1434]
The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x600006b, fixp_error1 info: 0x20, fsmId:0, tslot:4, thread:0, ctxid:0, blk:31, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1446]
The error from device(chipId:6, dieId:0), serial number is 232, there is an aivec error exception, core id is 11, error code = 0x800000, dump info: pc start: 0x124837587968, current: 0x124837588284, vec error info: 0x5c00003fc9, mte error info: 0x200600006b, ifu error info: 0x2000111efbc00, ccu error info: 0x7836503b7a800075, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x124100433000.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1434]
The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x600006b, fixp_error1 info: 0x20, fsmId:0, tslot:4, thread:0, ctxid:0, blk:32, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1446]
The error from device(chipId:6, dieId:0), serial number is 232, there is an aivec error exception, core id is 12, error code = 0x800000, dump info: pc start: 0x124837587968, current: 0x124837588284, vec error info: 0x7300000064, mte error info: 0x200600006b, ifu error info: 0x2000111e75500, ccu error info: 0xc27201af7a800075, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x124100433000.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1434]
The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x600006b, fixp_error1 info: 0x20, fsmId:0, tslot:4, thread:0, ctxid:0, blk:33, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1446]
The error from device(chipId:6, dieId:0), serial number is 232, there is an aivec error exception, core id is 15, error code = 0x800000, dump info: pc start: 0x124837587968, current: 0x124837588284, vec error info: 0x680000016f, mte error info: 0x200600006b, ifu error info: 0x2000111ee30c0, ccu error info: 0x8a6d50d07a800075, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x124100433000.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1434]
The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x600006b, fixp_error1 info: 0x20, fsmId:0, tslot:4, thread:0, ctxid:0, blk:36, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1446]
The error from device(chipId:6, dieId:0), serial number is 232, there is an aivec error exception, core id is 20, error code = 0x800000, dump info: pc start: 0x124837587968, current: 0x124837588284, vec error info: 0x680000016f, mte error info: 0x200600006b, ifu error info: 0x2000111edab40, ccu error info: 0x7629795c7a800075, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x124100433000.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1434]
The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x600006b, fixp_error1 info: 0x20, fsmId:0, tslot:4, thread:0, ctxid:0, blk:37, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1446]
The error from device(chipId:6, dieId:0), serial number is 232, there is an aivec error exception, core id is 21, error code = 0x800000, dump info: pc start: 0x124837587968, current: 0x124837588284, vec error info: 0x5c00003fc9, mte error info: 0x200600006b, ifu error info: 0x2000111ed4ac0, ccu error info: 0x847cc0897a800075, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x124100433000.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1434]
The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x600006b, fixp_error1 info: 0x20, fsmId:0, tslot:4, thread:0, ctxid:0, blk:38, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1446]
The error from device(chipId:6, dieId:0), serial number is 232, there is an aivec error exception, core id is 22, error code = 0x800000, dump info: pc start: 0x124837587968, current: 0x124837588284, vec error info: 0x730000007e, mte error info: 0x200600006b, ifu error info: 0x2000111d982c0, ccu error info: 0x7836503b1a8000de, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x124100433000.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1434]
The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x600006b, fixp_error1 info: 0x20, fsmId:0, tslot:4, thread:0, ctxid:0, blk:39, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1446]
The error from device(chipId:6, dieId:0), serial number is 232, there is an aivec error exception, core id is 6, error code = 0x800000, dump info: pc start: 0x124837587968, current: 0x124837588284, vec error info: 0x5c00003fc9, mte error info: 0x200600006b, ifu error info: 0x2000111e40d40, ccu error info: 0x127070cc2b0000b1, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x124100433000.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1434]
The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x600006b, fixp_error1 info: 0x20, fsmId:1, tslot:4, thread:0, ctxid:0, blk:27, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1446]
The error from device(chipId:6, dieId:0), serial number is 232, there is an aivec error exception, core id is 7, error code = 0x800000, dump info: pc start: 0x124837587968, current: 0x124837588284, vec error info: 0x5c00003ff6, mte error info: 0x200600006b, ifu error info: 0x2000111eb3600, ccu error info: 0xd7a244127a800075, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x124100433000.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1434]
The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x600006b, fixp_error1 info: 0x20, fsmId:1, tslot:4, thread:0, ctxid:0, blk:28, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1446]
The error from device(chipId:6, dieId:0), serial number is 232, there is an aivec error exception, core id is 8, error code = 0x800000, dump info: pc start: 0x124837587968, current: 0x124837588284, vec error info: 0x6a00000075, mte error info: 0x200600006b, ifu error info: 0x2000111eb8e00, ccu error info: 0x6cf8f06c7a800075, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x124100433000.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1434]
The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x600006b, fixp_error1 info: 0x20, fsmId:1, tslot:4, thread:0, ctxid:0, blk:29, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1446]
The error from device(chipId:6, dieId:0), serial number is 232, there is an aivec error exception, core id is 9, error code = 0x800000, dump info: pc start: 0x124837587968, current: 0x124837588284, vec error info: 0x6a00000075, mte error info: 0x200600006b, ifu error info: 0x2000111eaef80, ccu error info: 0x7629795c7a800075, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x124100433000.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1434]
The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x600006b, fixp_error1 info: 0x20, fsmId:1, tslot:4, thread:0, ctxid:0, blk:30, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1446]
The error from device(chipId:6, dieId:0), serial number is 232, there is an aivec error exception, core id is 13, error code = 0x800000, dump info: pc start: 0x124837587968, current: 0x124837588284, vec error info: 0x5c00003ff6, mte error info: 0x200600006b, ifu error info: 0x2000111eabb80, ccu error info: 0x6c13551b7a800075, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x124100433000.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1434]
The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x600006b, fixp_error1 info: 0x20, fsmId:1, tslot:4, thread:0, ctxid:0, blk:34, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1446]
The error from device(chipId:6, dieId:0), serial number is 232, there is an aivec error exception, core id is 14, error code = 0x800000, dump info: pc start: 0x124837587968, current: 0x124837588284, vec error info: 0x5c00003fc9, mte error info: 0x200600006b, ifu error info: 0x2000111ed8900, ccu error info: 0x244388b7a800075, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x124100433000.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1434]
The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x600006b, fixp_error1 info: 0x20, fsmId:1, tslot:4, thread:0, ctxid:0, blk:35, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1446]
Kernel task happen error, retCode=0x31, [vector core exception].[FUNC:PreCheckTaskErr][FILE:davinci_kernel_task.cc][LINE:1366]
AIV Kernel happen error, retCode=0x31.[FUNC:GetError][FILE:stream.cc][LINE:1119]
Aicore kernel execute failed, device_id=0, stream_id=2, report_stream_id=2, task_id=6348, flip_num=0, fault kernel_name=dequantize_blockwise_fp32_nf4_1, fault kernel info ext=none, program id=0, hash=12067931037022988496.[FUNC:GetError][FILE:stream.cc][LINE:1119]
[AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1119]
rtStreamSynchronize execute failed, reason=[vector core exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
synchronize stream failed, runtime result = 507035[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
(function copy_between_host_and_device_opapi)
0%| | 0/50 [00:01<?, ?it/s]
Traceback (most recent call last):
File "/home/ma-user/work/qwen-image/nf4_inference.py", line 45, in
image = pipe(
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.10/site-packages/diffusers/pipelines/qwenimage/pipeline_qwenimage.py", line 655, in call
noise_pred = self.transformer(
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.10/site-packages/diffusers/models/transformers/transformer_qwenimage.py", line 594, in forward
image_rotary_emb = self.pos_embed(img_shapes, txt_seq_lens, device=hidden_states.device)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.10/site-packages/diffusers/models/transformers/transformer_qwenimage.py", line 202, in forward
self.pos_freqs = self.pos_freqs.to(device)
RuntimeError: ACL stream synchronize failed, error code:507035
[W807 18:53:28.156241503 compiler_depend.ts:526] Warning: NPU warning, error code is 507035[Error]:
[Error]: The vector core execution is abnormal.
Rectify the fault based on the error information in the ascend log.
EH9999: Inner Error!
rtDeviceSynchronizeWithTimeout execute failed, reason=[vector core exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
EH9999: [PID: 624026] 2025-08-07-18:53:28.111.927 wait for compute device to finish failed, runtime result = 507035.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
TraceBack (most recent call last):
(function npuSynchronizeUsedDevices)
[W807 18:53:28.157702343 compiler_depend.ts:508] Warning: NPU warning, error code is 507035[Error]:
[Error]: The vector core execution is abnormal.
Rectify the fault based on the error information in the ascend log.
EH9999: Inner Error!
rtDeviceSynchronizeWithTimeout execute failed, reason=[vector core exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
EH9999: [PID: 624026] 2025-08-07-18:53:28.113.595 wait for compute device to finish failed, runtime result = 507035.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
TraceBack (most recent call last):
(function npuSynchronizeDevice)
[W807 18:53:28.158988663 compiler_depend.ts:151] Warning: NPU warning, error code is 507035[Error]:
[Error]: The vector core execution is abnormal.
Rectify the fault based on the error information in the ascend log.
EH9999: Inner Error!
rtDeviceSynchronizeWithTimeout execute failed, reason=[vector core exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
EH9999: [PID: 624026] 2025-08-07-18:53:28.114.891 wait for compute device to finish failed, runtime result = 507035.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
TraceBack (most recent call last):
(function empty_cache)
[ERROR] 2025-08-07-18:53:33 (PID:624026, Device:0, RankID:-1) ERR99999 UNKNOWN applicaiton exception

…nabled Signed-off-by: SlightwindSec <slightwindsec@gmail.com>

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>

SlightwindSec · 2025-11-07T07:49:46Z

Hi @unlizi , thanks for the review.

I've pushed a fix for the issue you reported. Big thanks to @jiaqiw09 for helping debug and solve this.

Could you please try re-running the validation?

jiaqiw09 · 2025-11-07T16:55:17Z

@SlightwindSec Thanks for your fix. I made several test cases, and they all ran successfully without any bugs.

Environment:

Ascend NPU 2025RC3 HDK driver and CANN
Python 3.10
Torch / Torch-NPU 2.5.1

Inference

I tested both LLaMA3-8B and Qwen3-4B models.
Inference works correctly.

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from bitsandbytes.functional import dequantize_4bit
import torch

quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")

model_id = "./llama3_8b"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="npu:0"
    trust_remote_code=True,
    dtype=torch.bfloat16,
    quantization_config=quantization_config,
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

text = "Hello my name is"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training

For training, I used the LLaMA-Factory framework to perform QLoRA fine-tuning with DDP and DeepSpeed ZeRO-3.
Everything worked as expected.

You can reproduce the setup using the script below:
examples/train_qlora/llama3_lora_sft_bnb_npu.yaml

jiaqiw09 · 2025-11-11T07:08:08Z

@SlightwindSec @matthewdouglas I have updated docs for install and README, would you mind having a check?

jiaqiw09 · 2025-11-14T01:45:37Z

@matthewdouglas Hi, the conflict in the CMakeLists.txt file have been resolved. Would you mind reviewing the changes and sharing any feedback or suggestions?. As mentioned in #1545, multi-Backend branch was deprecated, we are planning to merge ascend npu 4bit to main branch.

TimDettmers

PR Review: #1695 — Enable Ascend NPU Backend with Custom Ops Integration for NF4 Support

Classification: [feature] [platform] [build] — New NPU backend with native AscendC kernels, build system changes, and device dispatch integration.

Size: 562 additions, 5 deletions across 12 files. Medium-large, new backend PR with new files.

This PR adds an Ascend NPU backend supporting NF4 4-bit quantization and dequantization. It includes a Python-side quantize_4bit implementation, native AscendC dequantization kernels (npu_kernels.cpp), CMake build integration, and cextension.py/__init__.py wiring. The scope is focused on NF4 only (no FP4, no 8-bit ops, no optimizers).

Blocking issues (4):

cextension.py: NPU detection takes priority over CUDA — In the BNB_BACKEND detection chain, is_npu_available() is inserted before torch.cuda.is_available(). Since is_npu_available() only checks whether the torch_npu package is installed (via importlib.util.find_spec), NOT whether an actual NPU device is present, any system that has torch_npu installed alongside CUDA will incorrectly set BNB_BACKEND = "NPU" instead of "CUDA". The XPU backend checks torch._C._has_xpu (actual device presence); the HPU backend checks torch.hpu.is_available(). The NPU backend needs an equivalent device-availability check, not just a package-installed check. Similarly, in get_native_library(), the NPU path would be selected over XPU when both packages are installed.
cextension.py: Broken if/elif chain in get_native_library() — The original code has two separate if blocks: if cuda_specs: ... binary_path = cuda_binary_path followed by a separate if torch._C._has_xpu: .... The PR converts the second block's preceding blank line into an elif is_npu_available(): clause, making the XPU check a separate if that runs unconditionally. The result: on XPU systems with torch_npu installed, the NPU path would be set, then immediately overwritten by the XPU path. But on NPU systems without XPU, the flow is correct only by coincidence. The library loading logic needs proper if/elif/elif chaining across all backends.
npu_ops.cpp: Memory leak — tilingHost is allocated with malloc() but never freed. Every call to dequantizeBlockwiseNf4() leaks sizeof(BlockwiseNf4TilingData) bytes. This is called on every dequantization operation (every Linear4bit forward pass), so the leak accumulates during inference and training.
npu_ops.cpp: Hardcoded core count — coreNum is hardcoded to 40 in both the tiling function and the kernel launch blockDim. This is specific to one Ascend chip variant (e.g., 910B). Different Ascend models have different core counts. This should either be queried at runtime or made configurable.

Additional issues (non-blocking but significant):

__init__.py: Redundant and insufficient availability check — The import guard if importlib.util.find_spec("torch") and importlib.util.find_spec("torch_npu") checks for package installation, not device availability. The find_spec("torch") check is redundant since torch is already imported at the top of the file. Other backends use device-availability checks (e.g., torch.xpu.is_available(), torch.hpu.is_available()). The NPU backend should check for actual NPU device availability, e.g., if importlib.util.find_spec("torch_npu"): import torch_npu; if torch.npu.is_available(): ... (following the HPU pattern).
quantize_4bit is pure PyTorch, not a native kernel — The NPU quantize_4bit implementation is a pure Python/PyTorch implementation, while dequantize_4bit uses native AscendC kernels. This asymmetry means quantization may be slow. The PR description acknowledges this ("temporary PyTorch version"), but it should be documented with a TODO in code.
quantize_4bit mutates global state — global _NF4_QUANT_TABLE; _NF4_QUANT_TABLE = _NF4_QUANT_TABLE.to(A.device) modifies the shared module-level _NF4_QUANT_TABLE from backends/utils.py. This is not thread-safe and would break if multiple devices are used simultaneously. Other backends using this table should be checked for the same pattern.
NF4 packing order differs from CUDA — The comment in quantize_4bit states "Pack 4-bit values in NPU-compatible order (low nibble first) to match NPU-specific unpacking logic; differs from CUDA's packing." This means NPU-quantized checkpoints are NOT compatible with CUDA-dequantization and vice versa. This is a significant interoperability concern: users who quantize on NPU cannot dequantize on CUDA. This should be prominently documented.
No tests included — The PR adds no test files. For a new backend, there should be at minimum integration tests for quantize_4bit and dequantize_4bit round-trip correctness, and parametrized device tests.
npu_kernels.cpp: Missing newline at end of file.
cextension.py: import importlib placement — The import is added after the from typing import Optional line but before the import torch line, violating the standard library grouping convention (it should be grouped with the other stdlib imports at the top, not after typing).
CMake: npu-smi parsing is fragile — The CMake auto-detection runs npu-smi info | awk ... 'NR==7 {print $3}' which relies on exact output format of npu-smi. Different CANN versions or locales may change the output format. The error message also has a typo: "Auto-detech" should be "Auto-detect".

Security assessment:

No network access, no command execution, no serialization exploits, no obfuscation patterns in the Python code.
The native C++ code (npu_ops.cpp, npu_kernels.cpp) uses AscendC SDK APIs (ACLRT_LAUNCH_KERNEL, kernel_operator.h). These are vendor-provided APIs analogous to CUDA runtime APIs. The code does not execute arbitrary commands or access the filesystem.
The CMake changes execute npu-smi during build configuration (analogous to rocminfo for ROCm). This is a build-time tool invocation, consistent with existing patterns.
pythonInterface.cpp changes are properly guarded behind #if BUILD_NPU preprocessor conditionals.
The import importlib addition is benign (already used elsewhere in the codebase).
No security concerns identified.

Downstream Impact

Risk level: LOW (but see blocking issue #1)

The PR adds a new backend without changing existing APIs. The matmul_4bit change in _functions.py adds and A.device.type != "npu" to an existing condition, which is additive and follows the same pattern used for HPU.

However, blocking issue #1 (NPU taking priority over CUDA when torch_npu is installed) could silently break CUDA users who happen to have torch_npu installed. This would be a regression for existing users.

Transformers: Not affected (additive backend)
PEFT: Not affected
Accelerate: Not affected
TGI: Not affected
vLLM: Not affected

Cross-PR Conflicts

PR #1853 (Add MPS backend): overlaps on __init__.py, README.md — both add new backend imports and README table entries. File-level conflicts likely.
PR #1826 (Metal backend with libtorch): overlaps on cextension.py, __init__.py, CMakeLists.txt — similar new-backend additions.
PR #1823 (Add mps backend python only): overlaps on cextension.py, __init__.py.
PR #1858 (Add k-bit blockwise quantization): overlaps on CMakeLists.txt, csrc/pythonInterface.cpp.

None of these are semantic conflicts — they are independent backend/feature additions that will need mechanical merge resolution.

Verdict: Request changes. The CUDA-priority regression (issue #1), broken if/elif chain (issue #2), memory leak (issue #3), and hardcoded core count (issue #4) must be fixed before this can merge. The missing tests (issue #9) are also a blocking concern for a new backend.

Security: Clear
Downstream impact: LOW (additive backend), but issue #1 risks CUDA regression
Tests: Missing — no tests included for the new backend
CI: Not triggered (fork PR, needs maintainer approval)
Serialization: NPU packing format differs from CUDA (cross-device checkpoint incompatibility)
Cross-PR conflicts: File-level conflicts with 4 open PRs (mechanical resolution)

TimDettmers · 2026-02-16T18:07:24Z

bitsandbytes/cextension.py

 if torch.version.hip:
    HIP_ENVIRONMENT = True
    BNB_BACKEND = "ROCm"
+elif is_npu_available():


Blocking: NPU detection takes priority over CUDA. is_npu_available() only checks if the torch_npu package is installed (via importlib.util.find_spec), not whether an actual NPU device is present. Since this elif comes before torch.cuda.is_available(), any system with torch_npu installed alongside CUDA will incorrectly set BNB_BACKEND = "NPU". This would silently break all CUDA users who happen to have torch_npu in their environment.

The fix: check for actual device availability (e.g., torch.npu.is_available() after importing torch_npu), and place the NPU check after CUDA in the priority chain, following the pattern of XPU/HPU.

TimDettmers · 2026-02-16T18:07:24Z

bitsandbytes/cextension.py

        binary_path = cuda_binary_path
-
+    elif is_npu_available():
+        binary_path = PACKAGE_DIR / f"libbitsandbytes_npu{DYNAMIC_LIBRARY_SUFFIX}"


Blocking: Broken if/elif chain. The original code had two independent if blocks here (one for cuda_specs, one for _has_xpu). By inserting elif is_npu_available() between them, the XPU check is now disconnected from the chain. The result is that on a system with both torch_npu and XPU, the NPU binary path is set, then immediately overwritten by the XPU path in the next if. This needs proper if/elif/elif/elif chaining across all backends.

TimDettmers · 2026-02-16T18:07:24Z

csrc/npu_ops.cpp

+    BlockwiseNf4TilingData *tilingHost;
+    tilingHost = (struct BlockwiseNf4TilingData *)malloc(tilingSize);
+    uint32_t error = get_dequantize_blockwise_nf4_tiling(blocksize, n, tilingHost);
+    if (error != 0) {


Blocking: Memory leak. tilingHost is allocated with malloc() but never freed. This leaks memory on every dequantization call (every Linear4bit.forward()). Add free(tilingHost) after the kernel launch, or use stack allocation since BlockwiseNf4TilingData is a small struct.

TimDettmers · 2026-02-16T18:07:25Z

csrc/npu_ops.cpp

+    tiling->ubSize = 196 * 1024;
+    uint32_t coreNum = 40;
+    uint32_t totalPkgNum = (n + blocksize - 1) / blocksize;
+    uint32_t singleCorePkgNum = (totalPkgNum + coreNum - 1) / coreNum;


Blocking: Hardcoded core count. coreNum = 40 is specific to one Ascend chip variant. Different Ascend models have different AI core counts (e.g., Ascend 310 has 2 cores, 910A has 32, 910B has 40). This should be queried from the hardware at runtime or passed as a parameter.

SlightwindSec and others added 5 commits July 1, 2025 11:12

Add NPU-compatible NF4 quantization and dequantization support

bacf273

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>

Fix ascend npu op compile error.

3096895

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>

Update NPU installation guide.

c0cb673

Merge branch 'main' into upstream_main_npu_enabled

fe43b40

fix(cmake): correct error message from "HIP" to "NPU" for macOS unsup…

d9e152f

…ported notice

SlightwindSec changed the title ~~Enable Ascend NPU Backend with Custom Ops Integration for NF4 Support~~ [WIP] Enable Ascend NPU Backend with Custom Ops Integration for NF4 Support Jul 4, 2025

matthewdouglas self-requested a review July 8, 2025 16:41

matthewdouglas self-assigned this Jul 8, 2025

matthewdouglas added the Ascend NPU Related to Ascend NPU backend label Jul 8, 2025

SlightwindSec added 2 commits November 7, 2025 11:21

Merge remote-tracking branch 'upstream/main' into upstream_main_npu_e…

607ecf9

…nabled Signed-off-by: SlightwindSec <slightwindsec@gmail.com>

fix MTE out of range error

e30021e

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>

SlightwindSec changed the title ~~[WIP] Enable Ascend NPU Backend with Custom Ops Integration for NF4 Support~~ Enable Ascend NPU Backend with Custom Ops Integration for NF4 Support Nov 10, 2025

add install doc and update readme

19ce67d

Merge branch 'main' into upstream_main_npu_enabled

0f63009

SlightwindSec added 3 commits November 29, 2025 10:21

Merge branch 'main' into upstream_main_npu_enabled

e4a3a78

Merge branch 'main' into upstream_main_npu_enabled

77abc6b

Merge branch 'main' into upstream_main_npu_enabled

9f7c7f3

TimDettmers mentioned this pull request Feb 16, 2026

[Docs Update] QLoRA 4-bit Support on ROCm #1857

Merged

TimDettmers reviewed Feb 16, 2026

View reviewed changes

TimDettmers mentioned this pull request Feb 21, 2026

[Ascend][Bug] Fail to use 4-bit quant model to inference #1465

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable Ascend NPU Backend with Custom Ops Integration for NF4 Support#1695

Enable Ascend NPU Backend with Custom Ops Integration for NF4 Support#1695
SlightwindSec wants to merge 12 commits intobitsandbytes-foundation:mainfrom
SlightwindSec:upstream_main_npu_enabled

SlightwindSec commented Jul 2, 2025 •

edited

Loading

Uh oh!

unlizi commented Aug 7, 2025

Uh oh!

SlightwindSec commented Nov 7, 2025

Uh oh!

jiaqiw09 commented Nov 7, 2025 •

edited

Loading

Uh oh!

jiaqiw09 commented Nov 11, 2025

Uh oh!

jiaqiw09 commented Nov 14, 2025 •

edited

Loading

Uh oh!

TimDettmers left a comment

Uh oh!

TimDettmers Feb 16, 2026

Uh oh!

TimDettmers Feb 16, 2026

Uh oh!

TimDettmers Feb 16, 2026

Uh oh!

TimDettmers Feb 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

SlightwindSec commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Collaborators

Uh oh!

unlizi commented Aug 7, 2025

Uh oh!

SlightwindSec commented Nov 7, 2025

Uh oh!

jiaqiw09 commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Inference

Training

Uh oh!

jiaqiw09 commented Nov 11, 2025

Uh oh!

jiaqiw09 commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TimDettmers left a comment

Choose a reason for hiding this comment

PR Review: #1695 — Enable Ascend NPU Backend with Custom Ops Integration for NF4 Support

Downstream Impact

Cross-PR Conflicts

Uh oh!

TimDettmers Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

TimDettmers Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

TimDettmers Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

TimDettmers Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

SlightwindSec commented Jul 2, 2025 •

edited

Loading

jiaqiw09 commented Nov 7, 2025 •

edited

Loading

jiaqiw09 commented Nov 14, 2025 •

edited

Loading